Fix stable CI test harness failures#1986
Conversation
🦋 Changeset detectedLatest commit: 5daac2c The changes in this PR will be included in the next version bump. This PR includes changesets to release 23 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
🧪 E2E Test Results❌ Some tests failed Summary
❌ Failed Tests🌍 Community Worlds (86 failed)mongodb (12 failed):
redis (9 failed):
turso (65 failed):
Details by Category✅ ▲ Vercel Production
✅ 💻 Local Development
✅ 📦 Local Production
✅ 🐘 Local Postgres
✅ 🪟 Windows
❌ 🌍 Community Worlds
✅ 📋 Other
|
| expect(port).toBe(fastAddr.port); | ||
| // Should complete reasonably quickly (Windows CI can be slow) | ||
| expect(elapsed).toBeLessThan(2000); | ||
| expect(elapsed).toBeLessThan(5000); |
There was a problem hiding this comment.
5s is too long - what's actually causing the get port to take so long and can we make this faster?
There was a problem hiding this comment.
Yep, that 5s bump was masking the wrong thing. The slow part was Windows netstat / process-port discovery, not the HTTP probe timeout itself.
I changed the test to pass explicit candidatePorts, so it bypasses OS discovery and measures the custom probe timeout directly. The assertion is back to less than 2000ms.
Summary
Fixes CI stability issues observed on the current
stablebranch:HEAD, so health checks no longer need to probe withPOST.step_completed. The Nuxt failure hadreadableStreamWorkflowcomplete before stream chunks were persisted, so the client observed an empty stream.writeToolOutputToUIstarted but never recordedstep_completed, leaving the workflow running until Vitest timed out; stuck stream writes now fail/retry instead of pinning the step forever.healthCheck()even though the following CLI health check passed, which showed the health-check API could hang outside its advertised timeout.addTenWorkflowruns complete in ~5s, thenworkflow inspect --withDatawas killed by the harness' old 20s subprocess timeout while fetching/decrypting remote run data.Date.now()values returned from replayed workflow code. The Express failure showedrun_startedtowait_completedwas >10s, but the replayed return value measured only the latter portion of the event stream.addTenWorkflowtest timeout aligned with observed Vercel prod queue/cold-start latency where the workflow completed after the old 60s test timeout.Includes a changeset for the touched packages.
CI notes
25894395244failed onlyE2E Vercel Prod Tests (nitro)plus the aggregate required check. The failedwebhookWorkflowrunwrun_01KRMJH5E2V02Y9KC4NJ0EHB0Pwascompleted; event timeline reachedrun_completedat +5.3s, so the remaining failure was in the webhook HTTP response path, not workflow execution.25895081248fixed Nitro and all local e2e jobs passed. The only app-specific failure wasE2E Vercel Prod Tests (vite), where both failedaddTenWorkflowruns were alreadycompleted(wrun_01KRMKS6SH17MSH902CATGVGYHandwrun_01KRMKTGK9AYX2S228DAWD4GHVreachedrun_completedin ~5s). The failure was the e2e harness killingworkflow inspect --withDatawithSIGTERMafter 20s.25898693068fixed Vite. Remaining failures wereE2E Vercel Prod Tests (express)andE2E Vercel Prod Tests (nitro): Express failedsleepingWorkflowbecause replayedDate.now()returned a 5509ms delta while the event timeline showedrun_startedat +0.5s andwait_completedat +11.3s; Nitro failedhookDisposeTestWorkflowbecause workflow 2 started before workflow 1 had disposed the shared token, producing a correcthook_conflict.25902464337fixed Vite, Express, and Nitro. Remaining failure wasE2E Vercel Prod Tests (nuxt):readableStreamWorkflowreturned an empty stream even though the step/run completed; the step event timeline completed in ~1s while the stream-producing step should take ~10s, showingstep_completedwas racing ahead of return stream serialization.25905802133fixed Nuxt. Remaining failures wereE2E Local Dev Tests (hono - stable)andE2E Local Prod Tests (nitro - stable), both failinghookWorkflowwithHookNotFoundErrorafter the fixed 5s hook-registration sleep, plusE2E Vercel Prod Tests (sveltekit), where DurableAgentsingle tool calltimed out while runwrun_01KRN8AVPVXZEGPVV2Z61KW1MYwas still running and its event timeline stopped atwriteToolOutputToUIstep_started.25913362448fixed the Hono/Nitro hook races and SvelteKit stream hang. Remaining failure wasE2E Vercel Prod Tests (astro): the direct queue-based health check timed out after 60s, while the following CLI health check passed in ~3s. That exposed thathealthCheck(..., { timeout })did not boundworld.queue()orreadFromStream()before response polling.25913963665on5daac2c87passed, including the previously failing Hono/Nitro hook lanes, SvelteKit DurableAgent lane, Astro prod health-check lane, and the aggregate required check: https://github.com/vercel/workflow/actions/runs/25913963665Validation
fnm exec --using v22.18.0 pnpm exec biome check packages/core/src/runtime/helpers.ts packages/core/src/runtime/helpers.test.ts packages/core/e2e/e2e.test.ts packages/world-vercel/src/streamer.ts packages/world-vercel/src/streamer.test.ts .changeset/stable-ci-e2e-cleanup.md(only pre-existing warnings ine2e.test.ts)fnm exec --using v22.18.0 pnpm vitest run packages/core/src/runtime/helpers.test.ts packages/world-vercel/src/streamer.test.ts packages/core/src/runtime/step-handler.test.ts packages/core/src/writable-stream.test.ts packages/core/src/step/writable-stream.test.tsfnm exec --using v22.18.0 pnpm --filter @workflow/core --filter @workflow/world-vercel buildgit diff --check@workflow/utilstests, builder package builds, SvelteKit/Astro production builds, Fastify dev rebuild smoke coverage, and Nitro Vercel preset builds.